High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

نویسندگان

چکیده

Data integration is an important component of Big analytics. One the key challenges in data record linkage, that is, matching records represent same real-world entity. Because computational costs, methods referred to as blocking are employed a part linkage pipeline order reduce number comparisons among records. In past decade, range techniques have been proposed. Real-world applications require approaches can handle heterogeneous sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), simple efficient approach for unsupervised schema-agnostic, based crafted use Term Frequency-Inverse Document Frequency. compare HVTB with multiple over datasets, including novel unstructured dataset composed titles abstracts scientific papers. thoroughly discuss results terms accuracy, resources, different characteristics datasets The simplicity yields fast computations does harm its accuracy when compared existing approaches. It shown be significantly superior other methods, suggesting simpler should considered before resorting more sophisticated methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Blocking Schemes for Record Linkage

Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, ...

متن کامل

Secure Blocking + Secure Matching = Secure Record Linkage

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we uti...

متن کامل

Towards Parameter-free Blocking for Scalable Record Linkage

linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. a main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. ...

متن کامل

A Comparison of Blocking Methods for Record Linkage

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality s...

متن کامل

Leveraging Unlabeled Data to Scale Blocking for Record Linkage

Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Knowledge Discovery From Data

سال: 2021

ISSN: ['1556-472X', '1556-4681']

DOI: https://doi.org/10.1145/3450527